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Abstract. Recently, great efforts have been dedicated to researches on 
the management of large scale graph based data such as WWW, social 
networks, biological networks. In the study of graph based data manage- 
ment, node disjoint subgraph homeomorphism relation between graphs 
is more suitable than (sub)graph isomorphism in many cases, especially 
in those cases that node skipping and node mismatching are allowed. 
However, no efficient node disjoint subgraph homeomorphism determina- 
tion (ndSHD) algorithms have been available. In this paper, we propose 
two computationally efficient ndSHD algorithms based on state spaces 
searching with backtracking, which employ many heuristics to prune the 
search spaces. Experimental results on synthetic data sets show that the 
proposed algorithms are efficient, require relative little time in most of 
the testing cases, can scale to large or dense graphs, and can accommo- 
date to more complex fuzzy matching cases. 



1 Introduction 

Recently, large scale graph based data management has received more and more re- 
search attentions, due to the broad application of graph based data. In the study of 
graph based data management, graph based pattern matching, i.e., to determine whether 
the structure of a pattern graph can match to that of a data graph, is the key of many 
problems about graph data management. 

Existing graph pattern matchings can be classified into two preliminary categories: 
exact matching and inexact matching. Exact matching requires that the matched two 
graphs are isomorphic to each other; i.e., exact graph pattern matching is based on 
graph isomorphism relations between graphs. While the inexact graph matching is 
often considered as subgraph isomorphism between graphs, which means that pattern 
graph P matches to data graph G if and only if P is subgraph isomorphic to G. 

However, in real applications, inexact graph pattern matching based on subgraph 
isomorphism cannot represent the fuzzy matching in some cases that node skipping 
or node mismatching is allowed. For example, as shown in Figure [T] although G2 is 
not a subgraph of G\, G2 still can be regarded as matched to Gi if node skipping or 
node mismatching is allowed. In other words, G2 is matched to G\ from the abstract 
topological structure perspective, because G2 retains the abstract topological structure 
of Gi if paths in Gi can be contracted into the corresponding edges in G2. 

However, this kind of fuzzy matching is more desired in many real applications than 
subgraph isomorphism based inexact matching. For instance, the discovery of frequent 
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Fig. 1. Inexact Matching Fig. 2. Topological Minor 



conserved subgraph patterns from protein interaction networks [112] is an important 
and challenging work in evolutionary and comparative biology, where 'conserved' just 
means the inexact graph pattern matching allowing node mismatch and node skipping. 
Similarly, in social network analysis, the direct connection between nodes usually is 
not the focus; instead, the high-level topological structure with independent paths 
contracted is of great interest. 

Using Graph Minor theory [|], the abstract topological structure in many real 
applications can be described as topological minor, and the relation between abstract 
topological structure and its detailed original graph can be described as node/vertex 
disjoint subgraph homeomorphism. However, to determine whether a pattern graph P 
is a topological minor of data graph G is not a trivial thing, and this problem has 
been proved to be NP-complete when P and G are not fixed [3]. Although Robertson 
and Seymour [3] have proposed a framework to solve minor containment problem 
that is a generalization of topology containment problem and [5] has implemented 
the framework, no efficient algorithms have been dedicated to solve ndSHD (in other 
contexts, also known as topological minor containment, homeomorphic embedding or 
topological embedding), to the best of our knowledge. 

To efficiently determine the node disjoint homeomorphism relation between two 
graphs, we propose two algorithms based on state space searching with backtrack, which 
integrate many heuristics into the searching procedure to prune the search spaces. The 
work in the paper is inspired by Ullmann's 6 subgraph isomorphism determination 
(SID) algorithm. However, for ndSHD, we need to do some more specific things. First, 
for ndSHD, not only node mapping space but also edge-path mapping space needs 
to be searched, whereas for SID only the former needs to be searched. Second, for 
ndSHD, according to the definition of topological minor, we need to perform pairwise 
independence determination of the paths to ensure the paths are disjoint. Third, for 
SID, only edge information is explored, while in ndSHD path information is explored 
too, which will be a great challenge to the efficiency of the algorithm since the amount 
of paths is exponential to the size of the graph. 

In a summary, we make the following contributions in this paper: 

1. We propose two efficient algorithms for node disjoint subgraph homeomorphism 
determination. To the best of our knowledge, it's the first paper dedicated to 
design practical efficient algorithms for node disjoint subgraph homeomorphism 
determination or topological minor containment determination problem. 

2. We investigate the properties of topological minors, and employ these properties 
as the heuristics to prune the search space. 

3. We present a systematic performance study of proposed algorithms. The experi- 
mental results show that the algorithms are efficient and scalable on synthetic data 
sets. 
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2 Preliminaries 

We begin with some basic notations that are used in [7J. Let G = (V, E, I) be a vertex 
labeled graph, where V is the set of vertices, E is the set of edges and E C V x V, 
and I is a label function i : V — * L , giving every vertex a label. (In this paper, we 
only focus on vertex labeled graphs. Unlabeled graph can be considered as a labeled 
graph with all vertexes having the same vertex label.) The vertex set of G is referred 
to as V(G), and its edges set as E(G). A path P in a graph is a sequence of vertices 
Vi,V2 r ~,Vk, where u, G V and ViVi+i G E. The vertices fi and vt are linked by P and 
are called its ends. The number of edges of a path is its length, and the path of length 
k is denoted as P k . A path is simple if its vertices are all distinct. Particularly, a group 
of paths are independent if none of the paths have an inner vertex on another path. 
In the other words, a path intersecting with other paths only at its ends can be called 
as an independent path. Be aware that the independent paths are the key to study 
topological minors of a graph. 

2.1 Topological Minor 

As described in [7J, a topological minor of a graph is obtained by contracting the 
independent paths of one of its subgraphs into edges. For example, in Figure X is a 
topological minor of Y , since X can be obtained by contracting the independent paths 
of G which is a subgraph of Y. Clearly, contracting independent paths helps simplify 
a (sub)graph without compromising its abstract topological information. 

Formally, as shown in Figure 2, if we replace all the edges of X with independent 
paths between their ends, so that these paths are pairwise node independent, namely 
none of these paths has an inner vertex on another path, then G is a subdivision of X, 
denoted as T(X). If G is a subgraph of Y, then X is a topological minor of Y . As a 
subdivision of X and a subgraph of Y, if G is obtained by replacing all the edges of X 
with independent paths with length from I to h, then G is a (I, h) -subdivision of X 
and T is a (I, h) -topological minor of Y . 

Given two graph X and Y , if X is a topological minor of Y , then there exists a 
corresponding node disjoint subgraph homeomorphism from X into Y , which is a pair 
of injective mappings (/, g) from X into Y, where / is an injective mapping from vertex 
set of X into that of Y and g is an injective mapping from edges of X into simple paths 
of Y such that (1) for each e(vi, V2) G E(X), g(e) is a simple path in Y with /(«i) and 
f(v2) as two ends;(2) all mapped paths are pairwise independent. In other words, if X 
is node disjoint subgraph homeomorphic to Y , all the edges of X can be mapped to 
a corresponding simple path of Y and all the mapped path are pairwise independent; 
all the nodes in X can be mapped to a corresponding node in F(all the mapped nodes 
are called branch nodes of Y). 

2.2 Problem Definition 

As shown in Figure [3] given two vertex labeled graphs Gi and G2, given the minimal 
path length I and the maximal path length h, the problem is whether Gi is a (l,h)- 
topological minor of G2, i.e., G\ is node disjoint homeomorphic to G2 so that all 
mapped paths in G2 have length from I to h. Obviously, this problem is a typical 
determination problem. When the answer is true, the homeomorphism mapping (f,g) 
also can be obtained. The solution to the determination problem also can be extended 
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to solve the enumeration problem, which is to find the entire valid homeomorphism 
mappings between two graphs. 

The answer to the problem is sensitive to the given parameter (l,h). For example, in 
Figure 3, if (I, h) is (2, 2), which means the edges in Gi can only be mapped to the paths 
in G2 with length 2, then nodes in Gi can be mapped to the four nodes in shadow in G2 
and the five edge-path mappings are 12 - 218, 13 - 296, 14 - 234, 23 - 876, 34 - 654. If 
(/, h) is (3, 3), Gi is not a topological minor of G2. The influence of parameter (I, h) on 
topology containment determination has been discussed in [8] in detail. 




Fig. 3. Running Example Fig. 4. Two Level State Space Searching 



3 Algorithm Framework 

To simplify the description of the algorithm, we first give some notations. Assume that 
vertex labeled graph Gi = (Vi, /1) is a (I, /i)-topological minor of vertex labeled 
graph G2 = (V2, E2, h) under the node disjoint subgraph homeomorphism (f,g), where 
/ : Vi -> V 2 and g : £1 -> P l U ... U P h , the image of E\ under mapping g is denoted as 
g(Ei) — {g(e)\e 6 Ei}. The number of vertices and edges of Gi and G2 are 711 , mi and 
nii m 2, respectively. For the convenience of notation, we call Gi as minor graph, and 
G2 as data graph; without explicit statement, in the following discussion, Gi always 
denote a minor graph, G2 always denote a data graph. 

3.1 A Rudimentary Algorithm 

To determine whether Gi is a (I, h) topological minor of G2 is equivalent to find a 
pair of mapping (/, g) between these two graphs. The mapping / maps the nodes 
in Gi to the nodes with the same label in G2 so that g can map each edge of Gi 
to a corresponding path in G2. Obviously, the final solution of the determination, 
i.e., the complete mapping (/,<?) between these two graphs, can be described as M = 
(NM, EPM),where NM C V1XV2 is the node match set and EPM Cftx (P l U...UP h ) 
is the edge-path match set. All the mapped nodes of G2 can be denoted as NM^ 2 \ 
and all the mapped paths of G2 can be denoted as EPM {2) . 

The process of finding the homeomorphism mapping can be suitably described by 
means of State Space Representation [S]. Each state s of the matching process can 
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be associated with a partial mapping solution M a — (N M s , EPM S ), where NM S 
and EPM S are the node match set and edge-path match set at state s, respectively. 
Obviously, A4 S contains all the matches we have found so far and probably become a 
subset of some final match set M. 

Given the two vertex labeled graphs as shown in Figure 3, a naive two level state 
space searching procedure for a (2, 2) topological mapping is shown as Figure 4, where 
the first level is to find a suitable node mapping solution (shown in the dotted box 
of Figure H|a)) and the second level is to find a suitable edge-path mapping solution 
(shown in the dotted box of Figure [4jb) ) . The corresponding algorithm framework is 
shown as follows. 

Algorithm ndSHDl(Gi,G 2 ,Z,ft) 

Input: Gi,G2:vertex labeled graphs; /minimal path length; /i:maximal path length. 
Output: If Gi is a (I, /i)-topological minor of G2 return true and return the first found 
node disjoint subgraph homeomorphism (f,g), otherwise return false. 

1. Initial(M);/*Initialize SHD, Generate necessary path information, Initialize the 
basic data structures*/ 

2. Initial(ii); 

3. s <— 0; /^initialize state as empty state*/ 

4. s <— NodeMappingSearch(s, M,R); /*node mapping space search*/ 

5. if not IsValid(s) 

6. return false; 

7. else 

8. s <— EdgePathMappingSearch(s); /*edge-path mapping space*/ 

9. if not IsValid(s) 

10. return false; 

11. else 

12. return true; 

At first, we initialize two basic data structures : node compatible matrix M and 
independent path matrix R as well as its associated path indexed structure. Then we 
start the node matching process from the empty state. Each time we select a branch in 
the state space, a state s transits to a new successor state s' by adding a new match, 
which is a node pair or an edge path pair, to the partial solution. Each time a new 
match state arrives, M and R are updated so that the node mapping space and edge- 
path mapping space can be pruned. When a complete node mapping has been found, 
the matching process will come to the second level: edge-path matching space search. 
Similar to the search process in node mapping space, each time a branch is selected, 
an edge-path pair is added to the partial mapping solution and the independent path 
matrix is updated. The process continues until a complete edge-path mapping is found. 

In the above searching process, if all the possible valid branches in the subspace 
rooted at current state s have been explored, but still no valid match can be found, 
the searching process backtracks to the parent state of s. And any time the procedure 
enters into dead state which will be discussed in 3.5, the whole process will stop and 
return false which means the two graphs do not satisfy the (I, /i)-topological minor 
relationship. 

3.2 Basic Data Structure 

As described above, we need to two basic data structures, one is used to represent 
the node mapping information; the other is used to represent (I, h) independent path 
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information of G2. For the former, we use node compatible matrix; the latter, we use 
independent path matrix as well as a path index structure. Both of them are changing 
with the transition of the matching state. 



M° = 
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Fig. 5. M° and M 1 



Fig. 6. R and its associated Path Indexed Struc- 
ture 



We define node compatible matrix M — [rriij] to be a m (rows)xri2 (columns) 
matrix whose elements are l's or 0's. At the final success state, we can get a final 
mapping matrix M' — [m'ij] whose elements are l's or 0's, such that each row contains 
exactly one 1 and each column contains no more than one 1. The final mapping matrix 
represents a valid one to one mapping between nodes of Gi and G2 , while the initial 
compatible matrix M° represents the probable mappings between nodes of G\ and G2. 
The initial node mapping and the final node mapping between Gi and G2 in Figure 3 
is shown in Figure 5. Obviously for each element m'ij of M' , (m'ij = 1) — > (m^ = 1). 

Clearly, to reduce the number of l's in M is the key to speed up the search procedure 
in node mapping space. Hence, the first key step is to construct an initial compatible 
matrix M° with as less l's as possible. For this reason, we first introduce Lemma 1. 
Due to the limitation of space, the detailed proof is omitted in this paper. 

Lemma 1. The number of elements of the independent path set starting from a spec- 
ified vertex Vi € V is no more than d(vi), where d(vi) denotes the degree ofvi. 

Since every path set starting from u; necessarily pass through one or more edges 
incident with Vi, the independent path set starting from Vi has at most d(vi) elements. 
According to lemma 1, the node v in Gi cannot be matched to those nodes in G2 whose 
degree is less than d(v). Therefore, we construct the initial compatible matrix M° in 
accordance with the following rule: m^- = 1 if h(vi) = h{vj) Ad(vi) < d(vj), otherwise 
0. As shown in Figure 3, vi in Gi cannot mapped to v$ of G2, although these two 
nodes have the same label. 

When constructing Independent Path Matrix and its associated Path Indexed 
Structure, the first problem we face is whether we need to generate all the (I, h) path 
information of G2. The answer is false, which is based on the following lemma. 

Lemma 2. If Gi is a (I, h) -topological minor of G2 under subgraph homeomorphism 
{f,g), then g{Ei) only contains paths ending with those branch nodes in G2. 

For example, in Figure 3, since v$ in G2 cannot be a branch node, then all paths 
starting from v$ needn't to be enumerated. However, note that the path having v$ as 
inner vertex can not be ignored. 
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Therefore, we only need to enumerate all the (I, h) paths between all candidate 
branch node pairs. These candidate branch nodes can be filtered out by matrix M°. 
As shown in Figure [3] since column 5 and 9 have only O's, v$ and vg in G2 cannot 
be branch nodes, thus could be filtered out and the remaining nodes in G2 are just 
candidate branch nodes. The cardinality of the candidate branch node set is denoted 
as n' 2 . 

Then, we can define the independent path matrix R — [ry] to be n 2 (rows) xn' 2 
(columns) matrix whose elements are positive integers or O's, which represent the num- 
ber of (l,h) paths between the node pair (vi,Vj) in G2. The corresponding detailed 
path information are stored in a list array RLists, where each list in RLists contains 
corresponding path addresses that point to the physical storage of the path. RLists 
can be considered as a path index structure that is built according to the end vertex 
pair of the path. 

3.3 State Space Searching 

The procedure of node mapping space searching and edge-path mapping space search- 
ing are similar to each other. These two procedures are shown as follows. 

Algorithm Node/EdgePatbMappingSearchl (s,M,R) 

Input: s:the current matching state; M:the current node compatible matrix; R: the 
current independent path matrix. 

Output: found: a boolean variable indicating whether a complete node/edge-path 
mapping has been found. 

1. if(s is dead state) 

2. return false; 

3. if(s is complete mapping state) 

4. return true; 

5. let found^false 

6. while(not found && Exists Valid node/edge-path Mapping Pair) 

7. m «-GetNextNodePair(); /*m ^-GetNextEdgePathPair();*/ 

8. s' ^BackupState(s); 

9. NM 3 <- NM 3 U {m}; /*EPM S <- EPM(s) U {m}*/ 

10. Refine(M,i?); 

11. /ottnrf^Node/EdgePathMappingSearch(s, M, R); 

12. if(found) 

13. return true; 

14. else 

15. s ^RecoverState(s'); 

16. return false; 

From line 1-2, we can see that when a new state s arrives, s can be a dead state 
or success state (complete mapping state). The state space search arrives at a success 
state if all the node mappings or edge-path mappings have been found, which means 
\NM S \ = |Vi| or \EPM S \ = \Ei\, where NM S and EPM 3 are the node match set and 
edge-path match set at state s. The node mapping state space search arrives at a dead 
state if there is a row with all O's in node compatible matrix M of the current state, 
i.e. 3i, \NM„\ < i < ni,s.t. '}2 1< j <n2 niij = 0. And the edge-path mapping state space 
search arrives at a dead state if there is no path between any one branch node pairs, i.e., 
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3i, \EPM S \ < i < ni.s.t. U ( f~i {node(l)) j~i {node(j ))) eEl Uj = 0, where node(i) gets the 
vertex corresponding to the ith column in matrix R, which can be easily determined 
from independent path matrix R of the current state. 

Any time the search process enters into a success state or dead state, the procedure 
is over. If success state arrives, the complete mapping is found and the procedure 
returns true. If dead state arrives, the procedure returns false. On any other cases, the 
procedure will continue exploring the state space. The 1-11 lines describe the process. 

Assume the process comes to a state s that is only a partial solution. Then as long 
as there exists a valid mapping pair, i.e., a node pair or an edge-path pair, we need 
to generate a new state by adding the new match (line 9) to the existing solution M s . 
To enable backtracking, we need to backup the current state first (line 8), including 
the node compatible matrix, independent path matrix etc. Because after a new match 
added to the current solution, these two basic data structure will be refined to prune 
the following mapping space (line 10). Then DFS continues, until the search enters into 
dead state or success state. If we cannot find a success state in subtree space rooted at 
s, we recover the state s(line 15), and try the sibling state branches. 

3.4 Refinement Procedure 

To traverse all possible mapping branches is time consuming, so space pruning is es- 
sential for ndSHD. For this purpose, we devise two refinement procedures on R and 
M, respectively, the correctness of the former refinement is based on Lemma 3,4, and 
the latter is based on Lemma 5. 

(2) 

Lemma 3. In the matching process, let s be the current state, if v G NMs and M a 
will be a partial solution of some final solution M, then any path with v as inner vertex 
will not G EPM {2) . 

(2) (2) 

If v € NMs and NMr' will be a subset of some final solution, then v will a branch 
nodes of G2. Since branch nodes can only be the end vertex of the final independent 
path set, thus any path with v as inner vertex will not belong to the final independent 
path set. 

(2) 

Lemma 4. In the matching process, let s be the current state, if p G EPMs and M a 
will be a partial solution of some final solution M, then any path passing trough the 
inner vertex of p will not G EPM^ 2 ' ■ 

(2) 

Obviously, if p G EPMs , then all the path passing through any inner vertex of p 
will joint with p, so all these paths will not occur in EPM^ 2 \ 

Lemma 3 implies that, if a vertex v in G2 is added to the existing node match set, 
all the paths with v as inner vertex can be removed from RLists and the number in 
the corresponding element in R can be decreased. Lemma 4 implies that if we reach a 
new state by adding a new edge-path pair (eu,p2i), all the path passing through any 
inner vertex of p2t can be removed from RLists and the number in the corresponding 
element in R can be reduced. 

As shown in Figure [3] if vertex V2 in Gi is mapped to v% in G2, any path passing 
through v$ could be removed from RList, thus the potential edge-path mapping space 
could be pruned. As shown in Figure^ when (13, 296) is added to the partial solution, 
the subtree rooted at node (13 — 296) will be reduced, in the way that all the branches 
containing paths passing through vertex 779 will be pruned. 
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Lemma 5. In the matching process, let s be the current state, if (vi,Vj) G NM{v% G 
Vi,Vj G V2) and M s will be a partial solution of some final solution M, then the 
following statements hold true: 

1- Yl r j'k > 0,wherej' = Index(vj),k G Index(V) and V = {v2\vi G Adjacent(vi) A 
(vi,v 2 ) G NM S }. 

2. Vw G V' ,3v G V2 such that h(v) = h(v') and ry k > 0,wherj' = Index(vj),k — 
Index(v) and V = {v'\v' G Adjacent(vi) n (Vx - NMP)}. 

3. The path set consisting of the paths to which all mentioned rji k 's in (1) and (2) 
indicate is independent. 

In the above statements, the function Index(v) gets an index in R for a node v in 
G2; and Adjacent(v) obtains the adjacent nodes set of v. 

Suppose that the partial solution of current state will grow to be one final successful 
solution, Lemma 5 implies that two node Vi G Vi and Vj G V2 is compatible if only the 
three conditions are satisfied, i.e., if any one is not satisfied, my in node compatible 
matrix at state s, namely M s , could be refined to be 0. 

As shown in Figure[3] assume current matching state is s, and NM 3 = {(1, 2), (2, 8)}. 
Condition 1 implies that if (3,7) can be added to NM, i.e., V3 in Gi can be mapped to 
vr in G2, there must exist two independent paths from vj to V2 and vs in G2, otherwise 
r?i37 can be refined to be 0, thus the node mapping space could be pruned. Moreover, 
since «3 and U4 are adjacent in Gi, there must exist a corresponding path in G2 from 
V7 to some node with the same label as V4 of Gi, otherwise 77137 can be refined to be 0, 
which is stated in condition 2. Furthermore, all the above paths must be node disjoint, 
which is implied in condition 3. Obviously, if (I, h) is set as (2,2), paths connecting v-j 
to V2,va and vs all pass through node E. Hence 77737 in matrix M" can be refined to be 
0. 

3.5 More Efficient Searching Strategy 

A basic observation of the above refinement procedures is that the constraint resulting 
from an edge-path match will be more restricted than that resulting from a node 
match. Hence, a better strategy is to try edge-path match as early as possible, instead 
of performing edge-path match only after complete node match has been found. We 
denote these two strategy as si (old strategy) and S2(new strategy), respectively; and 
algorithms employing two strategies are denoted as ndSHDl and ndSHD2, respectively. 
Intuitively, in ndSHD2 the searching procedure will meet with the dead state very early 
if the current searching path will not lead to a successful mapping solution, thus the 
searching procedure will fast backtrack to try another mapping solution. 

As an example, assume that (I, h) is set as (2,2) and the current matching state 
is s such that NM 3 = {(1,2), (2,8)}. Since v\ and «2 is adjacent in Gi, why we not 
try to match a path in G2 for the edge e(v\,V2)7 If we do so, there are only two 
suitable edge-path pairs (12,298) and (12,218). Then, once (12,298) has been added 
to EPM S , all paths in G2 passing through vg will be excluded from R and Rlist, thus 
the searching space could be pruned early. Furthermore, we can see that the current 
partial mapping solution will not be a part of a final successful solution, thus any other 
solution with this partial solution as subset will be pruned. And if we try the edge-path 
pair (12, 218), we can eventually find a successful complete mapping solution. 

The framework of the ndSHD2 is similar to that of ndSHDl, which is omit- 
ted here. The detailed procedure of ndSHD2 is shown in NodeMappingSearch2 and 
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EdgcPatbMappingSearch2. Note that, in NodeMappingSearch2, after a new node match 
(vi,Vj) has been added to NM S , we can get an edge set consisting of edges that con- 
nect Vi to any vertex exiting in iVAfj 1 ' (line 11), namely E — {(vi,u)\u 6 iVAfj 1 ' n 
Adjacent(vi)}. If E = (for connected graph, it only happened at the initial stage of 
the first node match), we continue the node mapping searching procedure (line 12-13); 
otherwise, we switch to edge-path mapping search procedure to find valid path in G2 
for each edge in E (line 14-15). In EdgePatbMappingSearch2, if we can find valid paths 
for all the edges in E, the edge-path mapping search procedure returns true (line 3-4) 
and we will turn to the node mapping space search (line 13); otherwise, we continue 
the edge-path mapping search procedure (the while body). 

Algorithm NodeMappingSearch2 (s,M,R) 

Input and Output is the same as that in NodeMappingSearchl. 

1. if(s is complete mapping state) 

2. return true; 

3. if(s is dead state) 

4. return false; 

5. let found^false 

6. while(not found && Exists Valid node Mapping Pair) 

7. rn <— GetNextNodePair(); /*Get a next valid node pair*/ 

8. s' ^BackupState(s); 

9. NM S <- NM S U {m}; 

10. Reftne(M,R); 

11. E <-NewEdgeEmergent(s,Gi) 

12. if (E = 0) 

13. /ojind^NodeMappingSearch2(s, M, R); 

14. else 

15. /ound^EdgePathMappingSearch2(s, M, R, E); 

16. if (found) 

17. return true; 

18. else 

19. s ^RecoverState(s'); 

20. return found; 

Algorithm EdgePathMappingSearch2 (s,M,R,E) 

Input and Output is the same as that in EdgeMappingSearchl except E, which is 
the edges in G\ induced by NM^ . 

1. if(s is dead state) 

2. return false; 

3. if(s is complete mapping state with respect to E) 

4. return true; 

5. let /ottnd^false 

6. while(not found && Exists Valid edge-path Mapping Pair) 

7. rn ^GetNextEdgePathPairQ; /*Get a next valid edge-path pair*/ 

8. s' ^BackupState(s); 

9. EPM S <- EPM S U {m}; 

10. Refine(M,i?); 

11. /0Mnrf^EdgePathMappingSearch2(s, M, R, E); 
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12. if (found) 

13. /o«nd^NodeMappingSearch2(s, M, R); 

14. else 

15. s ^RecoverState(s'); 

16. return false; 



4 Experimental Evaluation 

To test the efficiency of the algorithm, we generate the synthetic data sets according to 
the random graph [10] model that links each node pair by probability p. All generated 
graphs are vertex labeled undirected connected graphs. We also randomly label every 
node so that the vertex labels are uniformly distributed. We implement the algorithm 
in C++, and carry out our experiments on a Windows 2003 server machine with Intel 
2GHz CPU and 1G main memory. 

The efficiency of the algorithms is influenced by the following factors: Nf. node 
size of Gi, N2: node size of G2, Mi: average degree of Gi, M2: average degree of G2, 
(L, H): the minimal and maximal path length. The efficiency also can be influenced by 
the number of vertex labels. Obviously, large number of labels will exert great constraint 
on the initial node compatible matrix, thus reduce the runtime significantly. However, 
for determination algorithms, the runtime also may be influenced by the answer to 
the determination. Generally, if the answer is false, in the worst case the algorithm 
may need to traverse the entire mapping spaces, which is very time consuming. If the 
answer is true, then in the best case the algorithm may only need to try one complete 
match procedure. However, in the following experiments, we can see that result of the 
determination has limited impact on the runtime, which could partly be attributed to 
the strong pruning ability of the refinement procedure. Due to these pruning techniques, 
even the result is false, the procedure will backtrack as early as possible, thus the whole 
runtime is rarely impacted. 




50 100 150 200 50 100 150 200 



Size of G 2 (x20) size ot G 2 (x20) 

(a) ndSHDl (b) ndSHD2 

Fig. 7. Efficiency and scalability with respect to the growth of size of data graph 
(G*2 ) • The inset of (a) , (b) show runtime of all running cases that the determi- 
nation result is true, false respectively. 
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First we will demonstrate the scalability with respect to the growth of the size of 
nodes of data graph G2 via experiment Expi . We use a complete graph with 4 uniquely 
labeled nodes, denoted as C4, as a minor graph; we generate overall 200 data graphs G2 
with node size varying from 20 to 4000 in an increment of 20. The average degree of each 
data graph is fixed as 4 and nodes of each graph are randomly labeled as one of overall 
20 labels. L and H are fixed as 1 and 3, respectively, meaning that the path length 
is in the range of [1,3]. Thus the parameters can be denoted as 7Vi4Mi3M24Ll//3. 
From Figure we can see that ndSHDl and ndSHD2 both are approximately linearly 
scalable with respect to the number of nodes in G2, irrespective of the result of the 
determination result. Notice that for G2 with about 4000 nodes and 8000 edges, the 
worst case of ndSHDl is no more than 13s, the worst case of ndSHD2 is no more than 
7s. 

Exp2 is designed to show the scalability of ndSHDl and ndSHD2 with respect to 
the size of Gi, where we fix some parameters as M\4N24kLlH3 and vary the size of 
Gi from 6 to 82 in increment of 4 to generate 20 minor graphs. Each minor graph 
is uniquely labeled, meaning that the number of labels equals to that of nodes. Two 
data graphs are used, one has average degree M2 as 8 and the other as 20. These two 
data graphs are randomly labeled as one of 200 labels. Figure |8ja) and (b) show the 
results with M2 set as 8 and 20, respectively; and these two experiments are denoted 
as Exp2\ and Exp22, respectively. The determination results of running case shown in 
Figure [8ja) are all false due to the relative sparsity of G2; and determination results of 
all running cases shown in Figur«[8jb) are true due to relatively higher density of G2. 
As can be seen, ndSHDl and ndSHD2 both are approximately linearly scalable with 
respect to the number of nodes in Gi. We also can see that, when G2 is sparse, the 
difference of performance between ndSHDl and ndSHD2 are so minute that can not 
be discerned; whereas as G2 becomes denser, running time of ndSHDl is not available, 
meaning that all running cases need time larger than one hour, while runtime increase 
of ndSHD2 is not very substantial. 




Fig. 8. Scalability with respect to the size of G±. Fig. 9. Scalability with respect 

to the upper bound of path 
length 

Exp3 is designed to show the scalability with respect to the growth of density of 
G2, parameters are fixed as Ni6M\5N2lkLlH3. Minor graph are uniquely labeled; 
data graphs are randomly labeled as one of 20 labels. Table [1] shows the running time 
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when we vary M2 from 2 to 20 in increment of 1. As can be seen, runtime of ndSHDl 
and ndSHD2 both approximately increase linearly with the growth of M2. However, 
we must note that for ndSHDl, there exists some outliers which consume too much 
time, e.g, when M2 = 17, more than 10 minutes are needed, when M2 = 16 running 
time is not available. Compared to ndSHDl, ndSHD2 is more stable. 



Table 1. Running time of Exp3 
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Figure [9] shows the runtime of the algorithm with respect to (l,h). Parameters of 
this experiment (denoted as Exp4), are set as Ni4Mi3N2lkM28Ll. The vertex labeling 
of minor graph and data graphs are the same as Exp3. As can been seen the broader the 
range is, the longer the running time is; and the runtime of ndSHDl and ndSHD2 both 
increase dramatically with the growth of upper bound of path length. However, the 
increasing speed of ndSHD2 is slower than that of ndSHDl, which implies that ndSHD2 
is more efficient than ndSHDl with respect to larger h. The super linearly growth of the 
runtime with the increase of upper bound of the path length can be partly attributed to 
the exponentially growth of the number of potential mapped paths. Luckily, in the real 
applications, larger upper bound is too unrestricted when performing fuzzy matching 
on graph data, thus usually upper bounds less than 3 are used. 




50 100 150 

Number of Labels in G g 



Fig. 10. Effect of label numbers of G2 on the performance of ndSHDl and 
ndSHDB. 

To examine the impact of number of vertex labels on the performance of ndSHDl 
and ndSHD2, we use a uniquely labeled graph with 6 nodes and 15 edges as minor 
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graph, a graph with 1000 nodes and 4000 edges as data graph. We randomly labeled 
the data graph from 10 labels to 200 labels in increment of 10 to generate 20 different 
labeled data graphs. L and H are set as 1 and 3, respectively. The result of this 
experiment (denoted as Exp5) is shown in Figure 1101 Clearly, runtime of ndSHDl 
and ndSHD2 substantially decrease with the growth of number of labels of G 2 , which 
confirms to what we have expected, since larger number of labels in G 2 can reduce 
the node mapping space between minor graph and data graph. We also can see that 
ndSHD2 outperforms ndSHDl to a great extent when label number is small. 

To examine the stability of ndSHDl and ndSHD2, we recorded in Table [2] the 
statistics including max, mean and standard deviation of sample data used in the above 
experiments. We can see that in almost all the experiments, the standard deviation of 
ndSHD2 is much less than that of ndSHDl, indicating that ndSHD2 is more stable 
than ndSHDl. 

In a summary, in some simple running cases, such as small size of minor graph, 
sparse data graphs, small value of upper bound of path length, larger number of labels 
in data graphs, both ndSHD2 and ndSHDl are scalable and efficient. However, in more 
complex cases, ndSHD2 will outperform ndSHDl substantially in all aspects, including 
scalability, efficiency and stability. 

In Figure llll we also illustrate the detailed searching procedure of two running 
cases to show the superiority of ndSHD2 to ndSHDl. In both of these two cases, 
ndSHD2 runs much faster than ndSHDl. Since the mapping searching procedure has 
been designed to be a recursive procedure, statistics about the recursive depth of each 
match (node-node match or edge-path match) will be a significant index indicating the 
performance of the algorithm. Hence, we recorded recursive depth of all matches in the 
searching procedure. Obviously, either narrow width of the exploring space or small 
value of the average backtrack depth, will lead to the less runtime of the algorithm. 
Hence ,from Figure ITT1 we can easily see the great advantage of ndSHD2 over ndSHDl, 
which can be attributed to the small value of the width or average backtrack depth in 
the actual exploring space. 



Table 2. Statistics of sample data in 5 experiments 

Expl Exp2i Ex P 2 2 Exp3 Expi Exp5 

statistics Si s 2 Si s 2 Si s 2 Si s 2 Si s 2 si s 2 

max 12.36 6.89 4.312 4.234 - 26.84 6509 11.41 33.44 12.56 107.9 6.297 

mean 1.727 1.49 1.709 1.73 - 10.06 408 4.32 6.325 2.675 8.312 1.392 

std 1.788 1.396 1.344 1.331 - 7.24 1530 3.797 13.34 4.937 24.89 1.909 



5 Conclusions 

In this paper, we investigated the problem known as node disjoint subgraph homeomor- 
phism determination; and proposed two practical algorithms to address this problem, 
where many efficient heuristics have been exploited to prune the futile searching space. 
The experimental results on synthetic data sets show that our algorithms are scalable 
and efficient. To the best of our knowledge, no practical algorithm is available to solve 
node disjoint subgraph homeomorphism determination. 
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Fig. 11. Recursive depths of two running cases 
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